{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OnbjvoqHM_h5"
      },
      "source": [
        "# Create cross-validation train--val splits for elite criticism classifier training\n",
        "\n",
        "*author:* Hauke Licht\n",
        "\n",
        "In this notebook, I create the cross-validation train--val splits for finding best-performing hyper-parameters when training elite criticism classifiers.\n",
        "\n",
        "The labeled dataset I use records 5.3K+ tweets that have been sampled from tweets posted by political parties from 20 Western countries between 2008 and early 2021.\n",
        "The annotations come from 6 crowd coders per tweet that I have aggregated into tweet-level labels using a Dawid and Skene ([1979](https://doi.org/10.2307/2346806)) annotation model (cf. [Paun et al. 2018](https://aclanthology.org/Q18-1040.pdf))."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tmrIPXRXg9lC"
      },
      "source": [
        "# Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8X7ISk4khCWR"
      },
      "source": [
        "Load modules:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "id": "Y5f1fFbETSbM"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import random\n",
        "import json\n",
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "\n",
        "from sklearn.model_selection import train_test_split, StratifiedKFold"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "id": "AYOhzvayKMq7"
      },
      "outputs": [],
      "source": [
        "base_path = os.path.join('..', '..')\n",
        "data_path = os.path.join(base_path, 'data', 'intermediate', 'training')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S3clKYWdhY9Q"
      },
      "source": [
        "Setup for reproducibility:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "id": "I26TKRr1hLTG"
      },
      "outputs": [],
      "source": [
        "SEED = 1234\n",
        "\n",
        "random.seed(SEED)\n",
        "np.random.seed(SEED)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XWqZ7LGMFeHV"
      },
      "source": [
        "# Data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eQWLAKg4aSeG"
      },
      "source": [
        "## Description"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BqBm2yyeYta4"
      },
      "source": [
        "The dataset we'll load has the following columns:\n",
        "\n",
        "- `item_id` (str): Unique ID of tweet (has been constructed by concatenating ISO-3-character country code of the party posting the tweet, `user_id`, and `status_id`)\n",
        "- `user_id` (int): the ID of the account that has posted the tweet\n",
        "- `status_id` (int): the ID of the tweet\n",
        "- `labeling` (str): the label class a tweet has been assigned to (i.e., its label)\n",
        "- `text` (str): The tweet's text (in its original language)\n",
        "- `test_` (bool): Boolean flag indicating tweets that should in the test (not the training) data split\n",
        "\n",
        "Note that user and status IDs are integers because they can be very long.\n",
        "Hence, I'll read them as int64 types to ensure they are not corrputed.\n",
        "(Alternatively, you could just treat them as strings.)\n",
        "\n",
        "To this end, I create a dictionary mapping column names to the desired data types that I'll pass when reading the CSV file:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {
        "id": "Wrj0Mxe2am0W"
      },
      "outputs": [],
      "source": [
        "col_types = {\n",
        "  'item_id': str,\n",
        "  'user_id': 'Int64',\n",
        "  'status_id': 'Int64',\n",
        "  'labeling': str,\n",
        "  'text': str,\n",
        "  'test_': bool\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rP65yYzRYLsb"
      },
      "source": [
        "## Load\n",
        "\n",
        "Load and read the labeled tweets dataset:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 21,
      "metadata": {
        "id": "DsmH1rS_XKgr"
      },
      "outputs": [],
      "source": [
        "fp = os.path.join(data_path, 'training_data_pooled_samples.csv')\n",
        "dat = pd.read_csv(fp, sep = ',', dtype = col_types)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "btNyCm4RYfiq"
      },
      "source": [
        "Set unique IDs as index."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 22,
      "metadata": {
        "id": "MW-9z7OLYPpL"
      },
      "outputs": [],
      "source": [
        "dat.set_index('item_id', inplace = True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 23,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "tJt3Y76G9W5Z",
        "outputId": "81de7004-c666-4552-bf70-7cb9e0e12dbc"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "labeling\n",
              "no              3344\n",
              "yes-general     1289\n",
              "yes-specific     768\n",
              "Name: count, dtype: int64"
            ]
          },
          "execution_count": 23,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "dat.labeling.value_counts()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n9kI6uU8cXP6"
      },
      "source": [
        "## Create binary labels\n",
        "\n",
        "Let's have a look at the `labeling` values:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 24,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "s6gLkIe8cbi0",
        "outputId": "368f621f-c345-4cd0-ed4b-f76e9b29bc3b"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "labeling\n",
              "no              3344\n",
              "yes-general     1289\n",
              "yes-specific     768\n",
              "Name: count, dtype: int64"
            ]
          },
          "execution_count": 24,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "dat.labeling.value_counts()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dw4127o6chWY"
      },
      "source": [
        "The labelings indicates whether a tweet contains\n",
        "\n",
        "1. **no** elite criticism,\n",
        "2. elite criticism directed at **the elite in general**, or\n",
        "3. criticism of **specific elites**.\n",
        "\n",
        "We argue that *the essence of anti-elite rhetoric* (as a political strategy) is generalized elite criticism.\n",
        "Hence, we are mainly interested in the distinction between 'general' elite criticism and all other statements.\n",
        "Accordingly, I create a **binary** label indicator."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 25,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5Et8jtZGdPM7",
        "outputId": "1be4bb4e-e319-40c7-f156-eb8965e94bd7"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "label_\n",
              "0    4112\n",
              "1    1289\n",
              "Name: count, dtype: int64"
            ]
          },
          "execution_count": 25,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "dat['label_'] = dat.labeling == 'yes-general' # positive (negative) class label => True (False)\n",
        "dat['label_'] = dat['label_'].astype(int)  # positive (negative) class label => 1 (0)\n",
        "dat.label_.value_counts()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bAMekJn7kFVU"
      },
      "source": [
        "# Split into training, validation, and test data sets"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t4zQ6tGSb7iN"
      },
      "source": [
        "Now we can split the dataset into the training and test partitions.\n",
        "To do so, we use the `test_` indicator column that comes with the dataset:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 26,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7LGMOnxAbwpP",
        "outputId": "8173564e-04d2-4faa-99e3-bd9e3419c66f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "No. train samples: 4342; pos. label proportion: 0.236\n",
            "No. test samples:  1059; pos. label proportion: 0.250\n"
          ]
        }
      ],
      "source": [
        "train_dat = dat[~dat.test_]\n",
        "print(f'No. train samples: {len(train_dat)}; pos. label proportion: {train_dat.label_.values.mean():.3f}')\n",
        "test_dat = dat[dat.test_]\n",
        "print(f'No. test samples:  {len(test_dat)}; pos. label proportion: {test_dat.label_.values.mean():.3f}')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 27,
      "metadata": {
        "id": "mbeua6Byaw8V"
      },
      "outputs": [],
      "source": [
        "# creat CV folds\n",
        "# note: shuffling but with fixed seed ensures reproducibility (see https://stackoverflow.com/a/51087273)\n",
        "kf = StratifiedKFold(n_splits = 5, shuffle=True, random_state=SEED)\n",
        "\n",
        "cv_ids = dict()\n",
        "\n",
        "for iter, (train_idxs, val_idxs) in enumerate(kf.split(train_dat, train_dat.label_.values)):\n",
        "  # save indeces\n",
        "  cv_ids[str(iter)] = dict(\n",
        "    train = train_dat.iloc[train_idxs].index.values.tolist(),\n",
        "    val = train_dat.iloc[val_idxs].index.values.tolist()\n",
        "  )"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 28,
      "metadata": {
        "id": "8y5aFEnViO4N"
      },
      "outputs": [],
      "source": [
        "with open(os.path.join(data_path, 'cv_ids.json'), 'w') as file:\n",
        "  json.dump(cv_ids, file, indent=2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rLR23BxIyC8g"
      },
      "source": [
        "<a id='ft_native'></a>"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
